Search CORE

Scientific Publications of the University of Toulouse II Le Mirail

Graph-Based ETL Processes For Warehousing Statistical Open Data

Author: Berro Alain
Megdiche-Bousarsar Imen
Teste Olivier
Publication venue: 'Scitepress'
Publication date: 01/01/2015
Field of study

ICEIS 2015 will be held in conjunction with ENASE 2015 and GISTAM 2015International audienceWarehousing is a promising mean to cross and analyse Statistical Open Data (SOD). But extracting structures, integrating and defining multidimensional schema from several scattered and heterogeneous tables in the SOD are major problems challenging the traditional ETL (Extract-Transform-Load) processes. In this paper, we present a three step ETL processes which rely on RDF graphs to meet all these problems. In the first step, we automatically extract tables structures and values using a table anatomy ontology. This phase converts structurally heterogeneous tables into a unified RDF graph representation. The second step performs a holistic integration of several semantically heterogeneous RDF graphs. The optimal integration is performed through an Integer Linear Program (ILP). In the third step, system interacts with users to incrementally transform the integrated RDF graph into a multidimensional schema

Scientific Publications of the University of Toulouse II Le Mirail

A Linear Program For Holistic Matching : Assessment on Schema Matching Benchmark

Author: Berro Alain
Megdiche-Bousarsar Imen
Teste Olivier
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

International audienceSchema matching is a key task in several applications such as data integration and ontology engineering. All application fields require the matching of several schemes also known as "holistic matching", but the difficulty of the problem spawned much more attention to pairwise schema matching rather than the latter. In this paper, we propose a new approach for holistic matching. We suggest modelling the problem with some techniques borrowed from the combinatorial optimization field. We propose a linear program, named LP4HM, which extends the maximum-weighted graph matching problem with different linear constraints. The latter encompass matching setup constraints, especially cardinality and threshold constraints; and schema structural constraints, especially superclass/subclass and coherence constraints. The matching quality of LP4HM is evaluated on a recent benchmark dedicated to assessing schema matching tools. Experimentations show competitive results compared to other tools, in particular for recall and HSR quality measure

Scientific Publications of the University of Toulouse II Le Mirail

Holistic Statistical Open Data Integration Based On Integer Linear Programming

Author: Berro Alain
Megdiche-Bousarsar Imen
Teste Olivier
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2015
Field of study

International audienceIntegrating several Statistical Open Data (SOD) tables is a very promising issue. Various analysis scenarios are hidden behind these statistical data, which makes it important to have a holistic view of them. However, as these data are scattered in several tables, it is a slow and costly process to use existing pairwise schema matching approaches to integrate several schemas of the tables. Hence, we need automatic tools that rapidly converge to a holistic integrated view of data and give a good matching quality. In order to accomplish this objective, we propose a new 0-1 linear program, which automatically resolves the problem of holistic OD integration. It performs global optimal solutions maximizing the profit of similarities between OD graphs. The program encompasses different constraints related to graph structures and matching setup, in particular 1:1 matching. It is solved using a standard solver (CPLEX) and experiments show that it can handle several input graphs and good matching quality compared to existing tools

Crossref

Scientific Publications of the University of Toulouse II Le Mirail

Transformer les Open Data brutes en graphes enrichis en vue d'une intégration dans les systèmes OLAP

Author: Berro Alain
Megdiche-Bousarsar Imen
Teste Olivier
Publication venue: HAL CCSD
Publication date: 01/01/2014
Field of study

National audienceThe Open Data integration in the decision systems is challenged by the absence of schema, the raw data and the semantic and structural heterogeneousness. In the literature, the most of authors studies the integration of RDF’Open Data in information systems besides the little percentage of available data in this format. On the other hand, few works are interested of Excel’Open Data despite they represent more than 90% of the available data.In this paper, we provide an automatic process that transforms raw Open Data in exploitable rich graphs. This process is validated by the users. This is part of our generic approach for integrating theOpen Data into multidimensional data warehouse.L’intégration des Open Data dans les systèmes OLAP est difficile en raison de l’absence de schémas sources, l’aspect brut des données et l’hétérogénéité sémantique et structurelle. La plupart des travaux existants s’intéressent aux Open Data de format RDF qui restent actuellement minoritairement disponibles. En revanche, peu de travaux s’intéressent aux Open Data de format brut, par exemple Excel qui représentent pourtant plus que 90% des données ouvertes disponibles. Dans cet article, nous proposons un processus automatique de transformation des Open Data brutes en graphes enrichis exploitables pour l’intégration. Ce processus est validé par l’utilisateur et s’inscrit dans notre démarche d’intégration des Open Data dans les entrepôts de données multidimensionnelles

Scientific Publications of the University of Toulouse II Le Mirail

Simuler les données manquantes dans les Open Data (INFORSID 2015 - Atelier Impact des data dans les systèmes d'information, Lyon, 20/05/15-23/05/15)

Author: Megdiche-Bousarsar Imen
Publication venue: HAL CCSD
Publication date: 01/01/2015
Field of study

AtelierInternational audienc

Holistic integration and automatic warehousing of open data

Author: Megdiche Bousarsar Imen
Publication venue
Publication date: 10/12/2015
Field of study

Les statistiques présentes dans les Open Data ou données ouvertes constituent des informations utiles pour alimenter un système décisionnel. Leur intégration et leur entreposage au sein du système décisionnel se fait à travers des processus ETL. Il faut automatiser ces processus afin de faciliter leur accessibilité à des non-experts. Ces processus doivent pallier aux problèmes de manque de schémas, d'hétérogénéité structurelle et sémantique qui caractérisent les données ouvertes. Afin de répondre à ces problématiques, nous proposons une nouvelle démarche ETL basée sur les graphes. Pour l'extraction du graphe d'un tableau, nous proposons des activités de détection et d'annotation automatiques. Pour la transformation, nous proposons un programme linéaire pour résoudre le problème d'appariement holistique de données structurelles provenant de plusieurs graphes. Ce modèle fournit une solution optimale et unique. Pour le chargement, nous proposons un processus progressif pour la définition du schéma multidimensionnel et l'augmentation du graphe intégré. Enfin, nous présentons un prototype et les résultats d'expérimentations.Statistical Open Data present useful information to feed up a decision-making system. Their integration and storage within these systems is achieved through ETL processes. It is necessary to automate these processes in order to facilitate their accessibility to non-experts. These processes have also need to face out the problems of lack of schemes and structural and sematic heterogeneity, which characterize the Open Data. To meet these issues, we propose a new ETL approach based on graphs. For the extraction, we propose automatic activities performing detection and annotations based on a model of a table. For the transformation, we propose a linear program fulfilling holistic integration of several graphs. This model supplies an optimal and a unique solution. For the loading, we propose a progressive process for the definition of the multidimensional schema and the augmentation of the integrated graph. Finally, we present a prototype and the experimental evaluations

Theses.fr

Entreposage d'Open Data : ODET

Author: Megdiche-Bousarsar Imen
Publication venue: HAL CCSD
Publication date: 01/01/2014
Field of study

Innovation IT Day, Digital PlaceInternational audienc

Scientific Publications of the University of Toulouse II Le Mirail

HAL Descartes

Vers l'intégration multidimensionnelle d'Open Data dans les entrepôts de données

Author: Berro Alain
Megdiche-Bousarsar Imen
Teste Olivier
Publication venue: Revue des Nouvelles Technologies de l'Information (RNTI)
Publication date: 01/01/2013
Field of study

L’émergence de nombreuses sources d’Open Data poussent plusieurs communautés de recherche ainsi que des entreprises à développer des outils permettant leur exploitation. En particulier, les données statistiques présentes dans les Open Data peuvent constituer des informations utiles aux analyses décisionnelles. Toutefois les Open Data très hétérogènes et disséminés en plusieurs morceaux de données sur le web, rendent difficile leur intégration au sein d’un entrepôt de données. Les travaux actuels sur l’intégration des Open Data proposent des processus d’intégration basés sur des Linked Open Data, dont la mise en place n’est pas automatisée. Dans cet article, nous proposons un processus visant à automatiser l’entreposage multidimensionnel des Open Data. Notre démarche repose sur la transformation des Open Data en un graphe générique et enrichi favorisant leur intégration. Ce graphe sert de support pour la définition semi-automatique et incrémentale du schéma multidimensionnel d’entreposage

Scientific Publications of the University of Toulouse II Le Mirail